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ABSTRACT 

Noting that improvement in rater reliability means 
eliminating differences among raters, this paper discusses ways to 
assess writing evaluator reliability and methods for achieving higher 
levels of interrater reliability. After showing that reliability can 
be improved two ways — by increasing the number of raters or 
measurements made, and by increasing the systematic variance among 
essays relative to error variance — the paper cites common problems in 
reporting and assessing reliability. The paper then recommends that 
researchers (1) use an analysis of variance** approach in assessing 
reliability; (2) indicate the number of independent observations; (3) 
use a two-way analysis of variance if more than one dimension is 
rated; (4) use **repeated measures" analysis of variance if rating 
more than one sample per student; and (5) use an **intraclass 
correlation coefficient** such as coefficient alpha in reports of 
research, or the **Pearson r** when two raters rate one dimension of 
the sample. Finally, the paper describes methods to increase 
interrater reliability such as controlling the range and quality of 
sample papers, spe^cixying the scoring task through clearly defined 
objective categorizes, choosing raters familiar with the constructs to 
be identified, and training the raters in systematic practice 
sessions. (Formulas for calculating reliability and training 
procedures for raters are included.) (JG) 
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IMPROVING INTERRATER RELIABILITY Zom^'S'uoZ'^SJ^'' 

In our presentations today on improving interrater reliability, 
we will be dividing our concerns into two major categories. First, 
I will discuss appropriate and meaningful ways to assess reliability. 
Secondly, Mary will present important considerations for achieving 
higher levels of interrater reliability. 

An informal survey we conducted on the reporting of reliability 
in the journal Research in the Teaching of English over the last five 
years has convinced us that we need to move tov^ard more uniform and 
interpretable reporting of reliability. 

Our specific recommendations today about reporting reliability 
grow out of a "theory of measurement" perspective in which reliability 
is conceptualized as variance. Specifically, reliability is defined 
as a ratio of true variance to total variance, as shown in the 
transparency (A). That is, reliability is the: ratio of variance due to 
real differences divided by variance due to both real differences 
and error. Measurement with zero error would then have a reliability 
of one. 
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As indicated on the transparency (B)^ reliability is calculated 
as a ratio of esti.mated true variance (between-product variance 
minus error variance) to total variance (between-product variance 
plus error variance). A correlation coefficient calculated in 
this way is known as an '•intraclass correlation coefficient" and 
is familiar to us in the context of reliability assessment as 
a ••coefficient alpha^** also presented on the transparency (B). 
Coefficient alpha or intraclass correlation coefficients can be 
calculated directly from an analysis of variance summary table 
as indicated on the transparency (C). 

In other words ^ reliability is subtracting out differences 
among raters ^ in order to find out how much of the total variance 
is due to ••true'* differences among essays. The reliability 
of a measure is the ratio of true or systematic variance to total 
variance in ratings associated with an ••average^^ or composite 
single rater. If the ratings of all raters can be used to estimate 
true scores^ then the reliability estimated by the intraclass 
correlation coefficient is increased to reflect the greater 
stability and accuracy of rater means. The Spearman-Brown prophecy 
formula estimates this increased reliability which is due to 
the u'ilization of multiple ratings. The calculations on the 
transpsjL-ency (C) show how the reliability coefficient shifts 
from .71 in this case to over .90 when four raters are used rather 
than just one. 

Now we are in a position to anticipate the kinds of suggestions 
that Mary will make for increasing the level of interrater reliability 
achieved. According to the Spearman-Brown prophecy formula— shown on 
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transparency (D)^ we have two options: one — increasing the number 

of raters or measurements that we make — that is ••K*' in the formula, 

and two—increasing essay or product variance relative to 

error variance — that is the term rii. Besides recruiting more raters 

then^ we can take steps that increase systematic variance and 

decrease random or error variance in order to achieve more reliability* 

Here in fact we have a recipe for increasing reliability. 

However, we want to emphasize that it is the 
analysis of variance approach that helps us see the effects 
of taking such steps as increasing the number of raters, and 
decreasing error variance by facilitating the perception 
of systematic differences. 

Instead of a composite index which reflects the impact 
of all factors affecting reliability, an analysis of variance approach 
allows us —to some extent— to examine the relative importc\nce of 
various factors contributing to achieved reliability. 

Furthermore, the very activity of setting up an analysis of 
variance table forces us to carefully identify the various components 
of our measurement design. An investigator must specify the number 
of independent observations, the number of dimensions assessed, 
and the number of products per subject included. THESE 
SAME ITEMS OF INFORMATION SHOULD ACCOMPANY ANY RELIABILITY REPORT 
IN THE LITERATURE. Otherwise, interpretation of the numerical 
value provided may be difficult. 

To illustrate some common difficulties that frequently arise 
in assessing reliability or ••measuring measurement, •• I've pulled out 
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soma typical descriptions from recent issues of Research in the 
Teaching of English. I've omitted citations not just in the interest 
of collegial harmony but also because these instances are not atypical — 
additional examples could easily be added. On the transparency (E) 
you see' four statements about reliability assessment. 

Beginning our discussion at the top of the list, let's examine 
the statement that ••Each essay was read by at least two raters." 
Although the implication that •'extra^' raters were used as needed seems 
initially comforting, a serious problem is generated for calculating 
a useful measure of reliability in such a case. We need to know 
the exact number of raters. 

The same problem is compounded in the next example: 
••In cases where the scores differed by more than two points, a 
a third rater was used and the extreme score dropped. •• 
Such an approach to rater-disagreement has the effect of 
leveling final scores and results in an information loss, as well 
as rendering problematical the meaningful assessment of reliability. 
If reliability is calculated anyway, an inflated value will result. 
Any rater-exchange must be related to factors extraneous to the rating 
situation — such as a rater dropping out because his four-year-oJd 
has the chicken-pox. This kind of rater-switch may well lower 
reliability, but does not constitute systematic biasing. 

Our third selection — ••Reliability was .97^^ — is one of our favorites 
although it probably just represents an oversight. It is of course 
essential to identify the way in which reliability was calculated. 
We do not know — although we hope the researcher does— whether the .97 
^represents a proportion of agreement, or a coefficient of some kind. 

ERLC 
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Our final quote reports that "Correlations among the three 
pairs of raters were .15$ .84, and .79." First, these ARE reasonable 
levels — remember Diederich's forceful assertion that it's very hard 
to get more that a Pearson r of .70 for two raters holistically 
rating essays. However, the situation is one in which three raters 
were used, while reliability was only assessed two-raters-at-a-time, 
so reliability is probably underestimated. 

To address these problems and to help both researchers and 
consumers of research meaningfully interpret reliability measures, 
we make the following recommendations shown on the transparency (F): 

RECOMMENDATIONS FOR CALCULATING AND REPORTING RELIABILITY ESTIMATES 

A. Use an "analysis of variance" approach in assessing reliability. 

a. Indicate number of independent observations. If pairs of 
raters confer before giving a rating, N = number of pairs. 
If raters work alone while rating, even though thoy train 

with other raters, or receive periodic feedback, 

N = number of raters. 

b. Number of dimensions assessed. If more than one dimension is 
rated, such as both "quality of ideas" and "correctness," use 
a two-way analysis of variance. 

c. Number of essays per student. If more than one sample of 
writing is used to estimate achievement, use "repeated measures" 
analysis of variance. 
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« 

B* Use an "intraclass correlation coef f icienf* such as coefficient 

alpha in reports of research. In the special case of two raters 

rating one dimension of the product for one product per student^ 

the familiar Pearson r is an equivalent measure of reliability. 

Both coefficient alpha and the Pearson coefficient of correlation 
can be readily generated by such widely available software 

packages as SPSSX — the Statistical Package for the Social Sciences^ 
recently Expanded. 

MARY WILL NOW DISCUSS PROCEDURAL STRATEGIES FOR INCREASING INTER- 
RATER RELIABILITY. 
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CALCULATING RELIABILITY 



r » TRUE VARIANCE 
ii 

TOTAL VARIANCE 



r « VARIANCE BETWEEN - VARIANCE WITHIN 
11 



TOTAL VARIANCE 



INTRACLASS CORRELATION COEFFICIENT 



ICCC = VARIANCE BETWEEN - VARIANCE WITHIN 



VARIANCE BETWEEN + (ave. leases per class - 1) VARIANCE WITHIN 



COEFFICIENT ALPHA (CRONBACH'S ALPHA) 



ALPHA = (number of raters) (average interrater correlation) 



1 + (average interrater correlation) (nwmber of raters) 



THE SPEARMAN-BROWN FORMULA 



RELIABILITY ASSESSMENT WHEN MORE THAN 
TWO RATERS ARE USED 



Numerical Example: 

(See Winer, pp. 288-289) 



ANALYSIS OP VARIANCE 

Source of variation SS df MS 

Between essays 122.50 5 24.50 

Within essays 36.00 18 2.00 

Between judges 17.50 3 5.83 

Residual 18.50 15 1.23 

TOTAL 138.50 23 

INTRACLASS CORRELATION COEFFICIENT: 



^ - Variance between - Variance within 
1 

Variance between + (ave.# cases per class - 1) Variance within 



24^50 - 2.00 
24.50 + (4-1) 2.00 

,7377 reliability coefficient of single judgment 



If mean of all four judges is used, reliability is higher, 
Using Spearman-Brown formula: 

r « 4 (.7377) 
4 

1 + (4-1) (.7377) 



" .9184 reliability coefficient for mean of four judgments. 



>4 



A NON-RANDOM SAMPLE OF STATEMENTS ABOUT 
RELIABILITY ESTIMATION TAKEN FROM THE LAST FIVE YEARS OF 
RESEARCH IN THE TEACHING OP ENGLISH 



1. Each essay was read by at least two raters. 

2. When scores differed by more than two points, 

a third rater was used and the extreme score dropped. 

3. Reliability was .97. 

3. Correlations among the three pairs of raters were 
.75, .84, and .79. 
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•A S5LECTE0 LIST OF RESOURCES FOR RELIABILITY ISSUES: 
.Measurement and Reliability in Education & Composition Studies: 
Asher, W.J. (1967). Measurement in educational research. 

Educational research and evaluation methods. Little » Brown. 
Cooper, C.R. & Odell, L. (1977). Evaluating writing: Descx'ibing, 
measuring , judging. Urbana, II: NOTE. 



Diederich, p. (1974). Measuring growth in English. 

Champaign, II., NOTE. 
Lauer» J.M. & Asher,W.J. ( 1987). Compor.ition research: 

Empirical designs. Oxford University Press. 

Comprehensive Discussions of Reliability Issues in Measurement Theory: 
Cronbach, L.J. (1971). Test validation. In R.L.Thorndike (Ed.). 

Educational measurement. Washington: American Council in 

Education (pp. 443-507). 
Kerlinger, F. N. (1986). Reliability. Foundations of Behavioral 

Research. 3rd ed. New York: Holt, Rinehard and Winston. 

ip. 404-416). 

Nunnally, J. (L979). Psychometric theory. New York: McGraw Hill 

(Chapters 6 and 7). 
Stanley, J. (1971). Reliability, in R.L. Throndike (Ed.) 

Educational measurement. Washington: American Council in 

Educational. (pp. 356-442). 
Statistical Procedures for Assessing Reliability: 

Cronbach, L. J. (1951). Coefficient alpha and the internal 
structure of tests. Psychometrika , 1951, 16, 297-334. 

Efcel, R.L. (1951). Estimation of the reliability of ratings. 
Psychometrika, 16, 407-424. 

SPSSX. (1986). User's Guide, 2nd. Chicago: McGraw-Hill. 

(The Statistical Package for the Social Sciences provides both 
software and extended discussion of the available options.) 
ErJc ^* (1971). Statistical principles in experimental design 

New York: McGraw Hln"(pp7"2i3 -28 9)7X1 



PROCEDURES FOR SECURING HIGH RELIABILITY 

In th« prsvious ««ction th» Intraclass Correlation 
Co«'f'fici«nt Mas pr«s«nt«d as ths most visibly clsar msans o* 
calculating rsliability} also msntionsd Msrs probl-^ms o* 
intvrprtrting rssults wh»n rsssarchsrs do not spscify how 
reliability Mas obtained. This section Mill address methods 
improving interrater reliability. These methods can be applied 
during or after training sessions, but they are best used as 
preparation for rating. 
WRITTEN PRODUCTS 

Content analysis experts like Krippendorf (1980) and Holsti 
<19691 advise analysing only well-defined writing tasks. Here 
the composition researcher is in trouble since the essay has 
multiple ways of being developed and organized. (See DeShields, 
Hsieh, and Frost (1984) for more on essay grading and 
reliability.) Nonetheless, the researcher can still take the 
precaution of removing any essays that clearly io not respond to 
the task. For example, if a persuasive essay was assigned and a 
student produced an expressive essay, the researcher should 
remove that essay and not force raters to identify a construct in 
it that probably doesn't exist. The confusion that would result 
from trying to score this essay introduces unsystematic error, 
thereby lowering the reliability. In the previous sectic-, it 
was stated th*t reliability is the ratio of true variance to 
total variance. Increasing the denominator with error variance 
yields a sMaller reliability figure. 

Anothttr precaution to take befr.re giving essays to raters 
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Mould bm to check -for restriction o-f range. Although you may 
' Hlsti ttJ '9»n»r«lix« to a popuiati^ ^ ^11 ^r^strwen ¥iriters, your 
particular »tud«nt body may not represent them fully. For 
•Kampl«9 a your admission standards ar« very high, your freshmen 
may not reveal national trends even though you have a variety oi 
types oi writers in your classes. Scores tend to cluster 
around a few categories, and raters may have a hard time 
distinguishing among papers. Raters dutifully try to 
stretch the papers over the scale, yet they end-up quibbling over 
small details they never would have seen in a more representative 
sample. In other words, restriction of range means insufficient 
product variance. Without enough variation in the products, 
findings are restricted and reliability can be lowered because of 
rater confusion. 
HEASUREMENT INSTRUMENT 

The measurement instrument is really the rater, a person 
sorting written products according to the categories assigned by 
the researcher. Therefore, the issue of reliability is bound up 
in many factors. Before discussing the human factors, let us 
consider the scoring task. Categories should be clearly defined. 
Raters should be told the basic unit of text to be classified — be 
it a word, paragraph, theme, essay, or otherwise (Weber, 198Si 
22«-23) . Thc! more objective the scoring task, the higher the 
reliability because an easy task promotes greater systematic 
variance (Nunnally, 1971 )• 

The number of categories also influences the reliability. 
The decision of how many to include depends on how many the 
raters can perceive. Using the maximum number of categories that 



* Since reliabiHty can be expressed as I - error variance 



total variance 




^ we yee that restriction of ranve (*faJsely low" totnl variance) 

J C results in a larger quantity subtracted from unity and 

<^o«ii«:*i|uently, an attenuated estisiatc of reliability. 
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rat«rs can perceive gives you the most information about your 
essays, maximizing systematic variance. For example, if raters 
can perceive 5 categories of audience-adaptation, and those 
categories are well defined and easily scored, the researcher 
will then have more information about the essays{ that is, 
between-essay variance is increased. Raters will tend to achieve 
higher reliability with 5 categories than if only 3 categories 
were used. 
RATER SELECTION 

Since your measurement instrument is the rater, and you need 
the most precise Judgment possible, you want yoxir raters to be 
experts by selection and by training. Therefore choose raters, 
familiar with the construct you wish to identify. I suggest 
environmental training methods much like Hillocks (1984) mode of 
composition instruction; namely a session where you elicit 
raters' preconceptions about that construct and then build a 
fresh notion together using their preconceptions and your 
definitions. I. you aee a rater who cannot or will not adjust 
his or her preconceptions to match your scoring task, do- not use 
that rater. Th*; , simply increase unsystematic variance by their 
inability to internalize that scoring system. SPSSx Reliability softwar 

helps you detect such raters. It deletes each rater and figures 
a subsequent reliability <alpha). When a person is deleted and 
the alpha iDfiCfiiSSSf yo" know who to remove. You may also wish 
to figure the reliability of various groups of raters. 

Adding raters can dramatically increase the reliability if 
the scoring task is fairly objective. The Spearman-Brown 
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Proph«cy Formula indicate this *«ct and r«v«al« that th« number 
o-f raters is perhaps the most easily adjusted ^actcsr a researcher 
has control o* before and a-Fter rating. The reason why 
additional raters can help so .nuch is that as raters are added» 
systematic variance accumulates faster than non-systematic 
variance, or error. If the scoring task is too vague however, 
increasing the number of raters will probably not affect the 
reliability. 
TRAINING PROCEDURES 

The goal of training is to build a firm knowledge base. 
Therefore, the researcher should begin training with sets of 
anchor papers for each of the categories on the scale. Once 
raters have a firm grasp of each category, then they can begin 
practice rating. It would be a mistake to hand raters a mixed 
pile of essays before they achieve this grasp of their task. 
Systematic variance, or rater agreement, is enhanced greatly by 
this firm notion of each category. 

Early in training also, the following types of rater errors 
should be discussed (Corsini, 1984i 209-206) i 

— halo (one trait influences the scoring) 
(balloon handwriting for example) 
— carryover (knowledge of student's ability) 

(knowing your students, esp. at small school) 
—central tendency (hesitancy to score on extremes) 

—sequential (order of papers affects scoring) 

— recency (emotional influences on raters). 

(death of the Challenger crew, personal things) 

When raters make these kinds of errors, their scoring 

P^ttmrn* bee(£;me erratic thus lowering the reliability. 

Finally, you are ready for practice sessions after the 
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knowl«dg» base is -firm and •rrors hava bean discussed. Be -firm 
«bout resolving dif-ferences as this builds knowledge. Raters 
mhouid confer and adjust scores jJyciQfl graining ftuigQi qqI^ or 
ii ttHlC QClfllQtl l&QCtl »1U bt CttilQtSl* Hillocks (1983) 
provides discussion of this procedure. In the previous section, 
it was noted how important rater independence is| if raters 
confer there is no interrater reliability since only one 
conglomerate score exists. 

As a final word, we would like to note how important 
conditions are for rating. Training and rating sessions should 
b« short, about 2 hours, to avoid fatigue. Refreshments should 
be provided; raters should be paid a fair wage. Copies of papers 
should be dark enough and legible. We would also suggest using 
pairs of raters to add to the interest and reliability of the 
scoring task. 

In closing, we suggest using the above methods for securing 
higher reliability before the rating sessions and using the 
Intraclass Correlation Coefficient for assessing reliability. 
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